The GENIA project: corpus-based knowledge acquisition and information extraction from genome research papers
نویسندگان
چکیده
We present an outline of the genome information acquisition (GENIA) project for automatically extracting biochemical information from journal papers and abstracts. GENIA will be available over the Internet and is designed to aid in information extraction, retrieval and visualisation and to help reduce information overload on researchers. The vast repository of papers available online in databases such as MEDLINE is a natural environment in which to develop language engineering methods and tools and is an opportunity to show how language engineering can play a key role on the Internet.
منابع مشابه
The GENIA Project: Knowledge Acquisition from Biology Texts
Overview of Project The GENIA project [9] (Fig. 1) seeks to automatically extract useful information from texts written by scientists to help overcome the problems caused by information overload. We intend that while the methods are customized for application in the microbiology domain, the basic methods should be generalisable to knowledge acquisition in other scientific and engineering domain...
متن کاملSteps towards a GENIA Dependency Treebank
In this paper we describe on-going work aimed at creating a dependency-based annotated treebank for the BioMedical domain. Our starting point is the GENIA corpus [14], which is a corpus of 2000 MEDLINE abstracts, which has been manually annotated for various biological entities, according to the GENIA Ontology.1 There is an exponential growth of published research in this sector, which makes it...
متن کاملThe GENIA Corpus: an Annotated Research Abstract Corpus in Molecular Biology Domain
With the information overload in genome-related field, there is an infreest need for natural language processing technology to extract information from literature and various attempts of information extraction using NLP has been being made. We are developing the necessary resources including domain ontology and annotated corpus from research abstracts in MEDLINE database (GENIA corpus). We are ...
متن کاملUse of OWL 2 to Facilitate a Biomedical Knowledge Base Extracted from the GENIA Corpus
The annotation of the GENIA corpus, a set of biomedical articles, targets the classification of biological entities based on their association with a domain-tailored taxonomy of categories. By incorporating information extraction process on the corpus we have developed a knowledge base (KB) that includes a more comprehensive taxonomy of categories, relationships between biological entities, and...
متن کاملMeta-Knowledge Annotation at the Event Level: Comparison between Abstracts and Full Papers
Biomedical literature contains rich information about events of biological relevance. Event corpora, containing classified, structured representations of important facts and findings contained within text, provide an important resource for the training of domain-specific information extraction (IE) systems. Such corpora pay little attention to the interpretation of events, e.g., whether an even...
متن کامل